CAS Logo Open main paper 🔗

2  Estimating the five fairness premiums

Objectives

Building on the simulations in sec-simul-dataset, this section estimates the spectrum of fairness. It complements Section 4.2 of the main paper, linked from the supplement’s introduction. We recommend keeping the main paper nearby, as this supplement focuses exclusively on implementation.

Packages for this section
library(tidyverse)
library(latex2exp)
library(jsonlite)

## also setup python for optimal transport. 
library(reticulate)
get_python_env_path <- function(file_path = "python_env_path.txt") {
  if (!file.exists(file_path)) {
    stop("The file 'python_env_path.txt' is missing. Please create it and specify your Python environment path.")
  }
  
  # Read the file and extract the first non-comment, non-empty line
  env_path <- readLines(file_path, warn = FALSE)
  env_path <- env_path[!grepl("^#", env_path) & nzchar(env_path)]
  
  if (length(env_path) == 0) {
    stop("No valid Python environment path found in 'python_env_path.txt'. Please enter your path.")
  }
  
  return(trimws(env_path[1]))
}
tryCatch({
  python_env_path <- get_python_env_path()
  use_python(python_env_path, required = TRUE)
  message("Python environment successfully set to: ", python_env_path)
}, error = function(e) {
  stop(e$message)
})
Data for this section
sims <- fromJSON('simuls/train_scenarios.json')
valid <-  fromJSON('simuls/valid_scenarios.json')
test <- fromJSON('simuls/test_scenarios.json')

## For the experiment in section 5
sim_samples <- jsonlite::fromJSON('simuls/sim_study.json')
Functions and objects from past sections
levels_for_premiums <- c("mu_B", "mu_U", "mu_A", "mu_H", "mu_C")
labels_for_premiums <- c("$\\widehat{\\mu}^B$","$\\widehat{\\mu}^U$", "$\\widehat{\\mu}^A$", "$\\widehat{\\mu}^H$", "$\\widehat{\\mu}^C$")

## four ingredients for the 5 families of premiums
premium_generator <- function(best, pdx, maps_to_corr, marg){
  list("mu_B" = function(x1, x2, d){
    
    ## simple call of the best estimate model
    best(data.frame(X1 = x1,
                    X2 = x2,
                    D = d))
  }, "mu_U" = function(x1, x2, d = NULL){
    
    ## Explicit inference : mu_U = E(mu_B|X)
    tab_best <- sapply(0:1, function(dl){
      best(data.frame(X1 = x1,
                    X2 = x2,
                    D = rep(dl, length(x1)))) 
      }) 
    tab_pdx <- sapply(0:1, function(dl){
      pdx(data.frame(X1 = x1,
                    X2 = x2,
                    D = rep(dl, length(x1))))
      })
    
    (tab_best * tab_pdx) %>% apply(1, sum)
  }, "mu_A" = function(x1, x2, d = NULL){
    
    ## mu_A = E(mu_B) unconditional
    sapply(0:1, function(d){best(data.frame(X1 = x1,
                    X2 = x2,
                    D = rep(d, length(x1))))*
       marg[d + 1]}) %>% apply(1, sum)
  }, "mu_H" = function(x1, x2, d = NULL){
    
    ## explicit inference of mu_C : mu_H = E(mu_C|X)
    tab_corr <- sapply(0:1, function(dl){
      sb <- best(data.frame(X1 = x1,
                    X2 = x2,
                    D = rep(dl, length(x1))))
      maps_to_corr(data.frame(mu_B = sb, D = dl))
      }) 
    tab_pdx <- sapply(0:1, function(dl){
      pdx(data.frame(X1 = x1,
                    X2 = x2,
                    D = rep(dl, length(x1))))
      })
    
    (tab_corr * tab_pdx) %>% apply(1, sum)
  }, "mu_C" = function(x1, x2, d = NULL){
    
    ## mu_C = T^{d}(mu_B(x, d))
    mu_b <- best(data.frame(X1 = x1,
                    X2 = x2,
                    D = d))
    maps_to_corr(data.frame(mu_B = mu_b, D = d))
  })
}

levels_for_quants <- c('proxy_vuln', 'risk_spread', 'fair_range', 'parity_cost')


quant_generator <- function(premiums){
  list('proxy_vuln' = function(x1, x2, d){
    premiums$mu_U(x1 = x1, x2 = x2) - 
      premiums$mu_A(x1 = x1, x2 = x2)
  },
  'risk_spread' = function(x1, x2, x3, d){
    to_minmax <- data.frame(risk1 = premiums$mu_B(x1 = x1,
                                              x2 = x2, 
                                              d = 1),
                            risk0 = premiums$mu_B(x1 = x1,
                                              x2 = x2, 
                                              d = 0))
    apply(to_minmax, 1, max) - apply(to_minmax, 1, min) 
  },
  'fair_range' = function(x1, x2, d){
    to_minmax <- data.frame(mu_b = premiums$mu_B(x1 = x1, x2 = x2, 
                                           d = d),
                           mu_u = premiums$mu_U(x1 = x1, x2 = x2, 
                                          d = NULL),
                           mu_a = premiums$mu_A(x1 = x1, x2 = x2, 
                                          d = NULL),
                           mu_h = premiums$mu_H(x1 = x1, x2 = x2,
                                          d = NULL),
                           mu_c = premiums$mu_C(x1 = x1, x2 = x2, 
                                          d = d))
    
    apply(to_minmax, 1, max) - apply(to_minmax, 1, min) 
  },
  'parity_cost' = function(x1, x2, d){
    premiums$mu_C(x1 = x1, x2 = x2,
              d = d) -
      premiums$mu_B(x1 = x1, x2 = x2, 
                d = d)
  })
}

the_CAS_colors <- c('#FED35D', '#F89708', '#205ED5', '#142345')

In this section, we provide an example of detailed methodology for estimating the benchmarks premiums from the spectrum of fairness defined in Section 4.1 of the main paper. It complements the estimation of Section 4.1.

In this paper, premiums are estimated using lightgbm algorithms from Ke et al. (2017), a package that supports common actuarial distributions (Tweedie, Poisson, Gamma, etc.) and was found to have excellent predictive performance compared to other boosting algorithm in Chevalier and Côté (2024). Given lightgbm’s flexibility, we advocate for careful variable preselection to:

We optimize hyperparameters with a validation set to prevent overfitting. Key hyperparameters in lightgbm are the number of trees (num_iterations), regularization (lambda_l1, lambda_l2) for complexity control, and subsampling percentages (feature_fraction, bagging_fraction) .

The data used to construct prices and the estimated spectrum should align to ensure comparability. In the Case Study of the main paper, we conduct a posteriori benchmarking, as the study period is historical. A posteriori assessment can reveal segments where the model misaligned from fairness pillars after the fact. In contrast, conducting benchmarking concurrently with the development of a new pricing model provides an estimate of how well the commercial price aligns with fairness pillars for future periods, though the true fairness implications of the commercial price will only become evident retrospectively.

Discretizing \(D\)

Subsequent steps involve numerous computations over all possible values of \(D\). When \(D\) is continuous, discretization can improve tractability. The discretized version should be used throughout this appendix, and the discretization function should be kept for future applications. For multivariate \(D\), one approach is to first discretize continuous components, then treat the vector as one categorical variable where each unique combination defines a category. Care should be taken to ensure a manageable number of categories, with sufficient representation in each.

Balancing benchmark premiums

Profit and expense loadings are important, but they must not reintroduce unfairness. To align premium benchmarks with expected profits, we multiply the premiums by a factor \(\delta_j \geq 1~\forall j \in \{B, U, A, H, C\}\) to globally balance all premiums to the level of the commercial price.

In what follows, we detail how we estimate each benchmark premium in our framework.

2.1 Best-estimate premium

The best estimate premium \(\widehat{\mu}^B(\mathbf{x}, d)\) serves as the anchor. Thus, its estimation is crucial for our framework. It provides the best predictor of the response variable \(Y\) when differentiating risks based on \((X, D)\). It can be derived from indicated rates or data-driven (what we do) estimates as detailed in Complement 5 of the main paper.

Training the data-driven best-estimate price
source('___lgb_best_estimate.R')

## clean the pred repo
unlink(file.path('preds', "*_best_estimate.json"))

# Define grid for hyperparameter optimization
  hyperparameter_grid <- expand.grid(
    learning_rate = c(0.01),
    feature_fraction = c(0.75),
    bagging_fraction = c(0.75),
    max_depth = c(5),
    lambda_l1 = c(0.5),
    lambda_l2 = c(0.5)
  )

best_lgb <- setNames(nm = names(sims)) %>% lapply(function(name){
  list_df <- list('train' = sims[[name]],
                  'valid' = valid[[name]],
                  'test' = test[[name]])
  the_best_estimate_lightgbm_fun(list_data = list_df, 
                                 name = name,
                                 hyperparameter_grid = hyperparameter_grid)
})
Best for scenario:  Scenario1  
hyperparam  1 / 1 
Best valid mse: 52.35364  
optimal ntree: 474  
Training time: 7.954376  sec. 
Best for scenario:  Scenario2  
hyperparam  1 / 1 
Best valid mse: 52.10566  
optimal ntree: 594  
Training time: 9.149719  sec. 
Best for scenario:  Scenario3  
hyperparam  1 / 1 
Best valid mse: 52.87658  
optimal ntree: 477  
Training time: 7.268097  sec. 
Training the best-estimate price on experimental data (for later use in section 5)
# Define grid for hyperparameter optimization
hyperparameter_grid_sims <- expand.grid(
    learning_rate = c(0.01),
    bagging_fraction = c(0.75),
    max_depth = c(6),
    lambda_l1 = c(0.5),
    lambda_l2 = c(0.5)
  )

best_sims <- lapply(seq_along(sim_samples), function(idx){
    list_df <- list('train' = sim_samples[[idx]]$train,
                  'valid' = sim_samples[[idx]]$valid,
                  'test' = sim_samples[[idx]]$test)
  the_best_estimate_lightgbm_fun(list_data = list_df,
                                 name = paste0('sim', idx),
                                 hyperparameter_grid = hyperparameter_grid_sims)
})
Best for scenario:  sim1  
hyperparam  1 / 1 
Best valid mse: 60.33479  
optimal ntree: 392  
Training time: 4.949132  sec. 
Best for scenario:  sim2  
hyperparam  1 / 1 
Best valid mse: 50.91685  
optimal ntree: 336  
Training time: 5.07208  sec. 
Best for scenario:  sim3  
hyperparam  1 / 1 
Best valid mse: 51.20826  
optimal ntree: 288  
Training time: 4.947608  sec. 
Best for scenario:  sim4  
hyperparam  1 / 1 
Best valid mse: 56.32536  
optimal ntree: 524  
Training time: 10.76273  sec. 
Best for scenario:  sim5  
hyperparam  1 / 1 
Best valid mse: 54.42696  
optimal ntree: 381  
Training time: 7.418698  sec. 
Best for scenario:  sim6  
hyperparam  1 / 1 
Best valid mse: 53.72137  
optimal ntree: 365  
Training time: 7.3409  sec. 
Best for scenario:  sim7  
hyperparam  1 / 1 
Best valid mse: 60.79705  
optimal ntree: 316  
Training time: 6.016717  sec. 
Best for scenario:  sim8  
hyperparam  1 / 1 
Best valid mse: 59.78302  
optimal ntree: 504  
Training time: 7.785273  sec. 
Best for scenario:  sim9  
hyperparam  1 / 1 
Best valid mse: 53.2293  
optimal ntree: 294  
Training time: 5.121347  sec. 
Best for scenario:  sim10  
hyperparam  1 / 1 
Best valid mse: 60.77194  
optimal ntree: 304  
Training time: 4.905564  sec. 
Best for scenario:  sim11  
hyperparam  1 / 1 
Best valid mse: 49.90746  
optimal ntree: 302  
Training time: 5.065884  sec. 
Best for scenario:  sim12  
hyperparam  1 / 1 
Best valid mse: 50.79596  
optimal ntree: 328  
Training time: 5.070839  sec. 
Best for scenario:  sim13  
hyperparam  1 / 1 
Best valid mse: 51.64612  
optimal ntree: 317  
Training time: 6.545786  sec. 
Best for scenario:  sim14  
hyperparam  1 / 1 
Best valid mse: 60.85369  
optimal ntree: 309  
Training time: 8.622002  sec. 
Best for scenario:  sim15  
hyperparam  1 / 1 
Best valid mse: 57.2588  
optimal ntree: 351  
Training time: 9.125745  sec. 
Best for scenario:  sim16  
hyperparam  1 / 1 
Best valid mse: 57.91143  
optimal ntree: 330  
Training time: 12.01773  sec. 
Best for scenario:  sim17  
hyperparam  1 / 1 
Best valid mse: 55.78472  
optimal ntree: 275  
Training time: 11.12442  sec. 
Best for scenario:  sim18  
hyperparam  1 / 1 
Best valid mse: 58.69825  
optimal ntree: 339  
Training time: 9.822523  sec. 
Best for scenario:  sim19  
hyperparam  1 / 1 
Best valid mse: 58.80992  
optimal ntree: 525  
Training time: 12.27094  sec. 
Best for scenario:  sim20  
hyperparam  1 / 1 
Best valid mse: 49.75454  
optimal ntree: 279  
Training time: 8.071054  sec. 
Best for scenario:  sim21  
hyperparam  1 / 1 
Best valid mse: 50.79592  
optimal ntree: 371  
Training time: 11.29008  sec. 
Best for scenario:  sim22  
hyperparam  1 / 1 
Best valid mse: 51.96514  
optimal ntree: 453  
Training time: 13.96027  sec. 
Best for scenario:  sim23  
hyperparam  1 / 1 
Best valid mse: 55.38553  
optimal ntree: 295  
Training time: 12.30759  sec. 
Best for scenario:  sim24  
hyperparam  1 / 1 
Best valid mse: 51.9121  
optimal ntree: 300  
Training time: 11.14981  sec. 
Best for scenario:  sim25  
hyperparam  1 / 1 
Best valid mse: 56.23475  
optimal ntree: 248  
Training time: 11.48121  sec. 
Best for scenario:  sim26  
hyperparam  1 / 1 
Best valid mse: 51.90556  
optimal ntree: 410  
Training time: 14.10936  sec. 
Best for scenario:  sim27  
hyperparam  1 / 1 
Best valid mse: 58.62825  
optimal ntree: 351  
Training time: 11.06315  sec. 
Best for scenario:  sim28  
hyperparam  1 / 1 
Best valid mse: 51.38316  
optimal ntree: 442  
Training time: 11.50402  sec. 
Best for scenario:  sim29  
hyperparam  1 / 1 
Best valid mse: 57.27832  
optimal ntree: 351  
Training time: 10.31913  sec. 
Best for scenario:  sim30  
hyperparam  1 / 1 
Best valid mse: 65.73161  
optimal ntree: 423  
Training time: 11.42914  sec. 
Best for scenario:  sim31  
hyperparam  1 / 1 
Best valid mse: 52.43089  
optimal ntree: 341  
Training time: 11.5907  sec. 
Best for scenario:  sim32  
hyperparam  1 / 1 
Best valid mse: 48.98963  
optimal ntree: 316  
Training time: 8.861043  sec. 
Best for scenario:  sim33  
hyperparam  1 / 1 
Best valid mse: 57.39815  
optimal ntree: 415  
Training time: 9.765657  sec. 
Best for scenario:  sim34  
hyperparam  1 / 1 
Best valid mse: 51.84479  
optimal ntree: 376  
Training time: 12.60381  sec. 
Best for scenario:  sim35  
hyperparam  1 / 1 
Best valid mse: 59.83096  
optimal ntree: 454  
Training time: 12.87287  sec. 
Best for scenario:  sim36  
hyperparam  1 / 1 
Best valid mse: 53.5909  
optimal ntree: 261  
Training time: 8.490556  sec. 
Best for scenario:  sim37  
hyperparam  1 / 1 
Best valid mse: 55.13199  
optimal ntree: 404  
Training time: 13.977  sec. 
Best for scenario:  sim38  
hyperparam  1 / 1 
Best valid mse: 54.7921  
optimal ntree: 286  
Training time: 7.808697  sec. 
Best for scenario:  sim39  
hyperparam  1 / 1 
Best valid mse: 54.24569  
optimal ntree: 284  
Training time: 7.932821  sec. 
Best for scenario:  sim40  
hyperparam  1 / 1 
Best valid mse: 54.19071  
optimal ntree: 352  
Training time: 11.59254  sec. 
Best for scenario:  sim41  
hyperparam  1 / 1 
Best valid mse: 52.58791  
optimal ntree: 329  
Training time: 11.89847  sec. 
Best for scenario:  sim42  
hyperparam  1 / 1 
Best valid mse: 52.76784  
optimal ntree: 397  
Training time: 10.58336  sec. 
Best for scenario:  sim43  
hyperparam  1 / 1 
Best valid mse: 62.81313  
optimal ntree: 391  
Training time: 11.67223  sec. 
Best for scenario:  sim44  
hyperparam  1 / 1 
Best valid mse: 55.71322  
optimal ntree: 395  
Training time: 10.81585  sec. 
Best for scenario:  sim45  
hyperparam  1 / 1 
Best valid mse: 60.5693  
optimal ntree: 354  
Training time: 5.684685  sec. 
Best for scenario:  sim46  
hyperparam  1 / 1 
Best valid mse: 56.32787  
optimal ntree: 367  
Training time: 5.85745  sec. 
Best for scenario:  sim47  
hyperparam  1 / 1 
Best valid mse: 55.9285  
optimal ntree: 382  
Training time: 5.153053  sec. 
Best for scenario:  sim48  
hyperparam  1 / 1 
Best valid mse: 46.62527  
optimal ntree: 349  
Training time: 5.137555  sec. 
Best for scenario:  sim49  
hyperparam  1 / 1 
Best valid mse: 53.27347  
optimal ntree: 278  
Training time: 4.485639  sec. 
Best for scenario:  sim50  
hyperparam  1 / 1 
Best valid mse: 55.30913  
optimal ntree: 404  
Training time: 5.489235  sec. 
Best for scenario:  sim51  
hyperparam  1 / 1 
Best valid mse: 55.59591  
optimal ntree: 260  
Training time: 4.289922  sec. 
Best for scenario:  sim52  
hyperparam  1 / 1 
Best valid mse: 51.41044  
optimal ntree: 277  
Training time: 4.498037  sec. 
Best for scenario:  sim53  
hyperparam  1 / 1 
Best valid mse: 57.98406  
optimal ntree: 475  
Training time: 6.874064  sec. 
Best for scenario:  sim54  
hyperparam  1 / 1 
Best valid mse: 55.82739  
optimal ntree: 357  
Training time: 5.330815  sec. 
Best for scenario:  sim55  
hyperparam  1 / 1 
Best valid mse: 63.26344  
optimal ntree: 358  
Training time: 5.425838  sec. 
Best for scenario:  sim56  
hyperparam  1 / 1 
Best valid mse: 59.99558  
optimal ntree: 319  
Training time: 4.607765  sec. 
Best for scenario:  sim57  
hyperparam  1 / 1 
Best valid mse: 46.54049  
optimal ntree: 293  
Training time: 4.214027  sec. 
Best for scenario:  sim58  
hyperparam  1 / 1 
Best valid mse: 50.22334  
optimal ntree: 363  
Training time: 5.167843  sec. 
Best for scenario:  sim59  
hyperparam  1 / 1 
Best valid mse: 57.2901  
optimal ntree: 365  
Training time: 6.151565  sec. 
Best for scenario:  sim60  
hyperparam  1 / 1 
Best valid mse: 59.99731  
optimal ntree: 390  
Training time: 6.228045  sec. 
Best for scenario:  sim61  
hyperparam  1 / 1 
Best valid mse: 57.14639  
optimal ntree: 459  
Training time: 7.398513  sec. 
Best for scenario:  sim62  
hyperparam  1 / 1 
Best valid mse: 57.85747  
optimal ntree: 315  
Training time: 5.435315  sec. 
Best for scenario:  sim63  
hyperparam  1 / 1 
Best valid mse: 56.39258  
optimal ntree: 312  
Training time: 4.824038  sec. 
Best for scenario:  sim64  
hyperparam  1 / 1 
Best valid mse: 48.83371  
optimal ntree: 332  
Training time: 4.845143  sec. 
Best for scenario:  sim65  
hyperparam  1 / 1 
Best valid mse: 62.17314  
optimal ntree: 524  
Training time: 6.391987  sec. 
Best for scenario:  sim66  
hyperparam  1 / 1 
Best valid mse: 50.93092  
optimal ntree: 403  
Training time: 5.870877  sec. 
Best for scenario:  sim67  
hyperparam  1 / 1 
Best valid mse: 55.27313  
optimal ntree: 367  
Training time: 7.046694  sec. 
Best for scenario:  sim68  
hyperparam  1 / 1 
Best valid mse: 62.22941  
optimal ntree: 375  
Training time: 6.435948  sec. 
Best for scenario:  sim69  
hyperparam  1 / 1 
Best valid mse: 53.87731  
optimal ntree: 348  
Training time: 5.496983  sec. 
Best for scenario:  sim70  
hyperparam  1 / 1 
Best valid mse: 55.38108  
optimal ntree: 299  
Training time: 5.304265  sec. 
Best for scenario:  sim71  
hyperparam  1 / 1 
Best valid mse: 55.88654  
optimal ntree: 407  
Training time: 7.996598  sec. 
Best for scenario:  sim72  
hyperparam  1 / 1 
Best valid mse: 53.81861  
optimal ntree: 299  
Training time: 5.838899  sec. 
Best for scenario:  sim73  
hyperparam  1 / 1 
Best valid mse: 54.05302  
optimal ntree: 352  
Training time: 5.271713  sec. 
Best for scenario:  sim74  
hyperparam  1 / 1 
Best valid mse: 53.68097  
optimal ntree: 417  
Training time: 6.030551  sec. 
Best for scenario:  sim75  
hyperparam  1 / 1 
Best valid mse: 52.77939  
optimal ntree: 367  
Training time: 5.415834  sec. 
Best for scenario:  sim76  
hyperparam  1 / 1 
Best valid mse: 54.30186  
optimal ntree: 450  
Training time: 6.671667  sec. 
Best for scenario:  sim77  
hyperparam  1 / 1 
Best valid mse: 56.50446  
optimal ntree: 390  
Training time: 5.272126  sec. 
Best for scenario:  sim78  
hyperparam  1 / 1 
Best valid mse: 54.13193  
optimal ntree: 323  
Training time: 4.735928  sec. 
Best for scenario:  sim79  
hyperparam  1 / 1 
Best valid mse: 63.37444  
optimal ntree: 601  
Training time: 8.198827  sec. 
Best for scenario:  sim80  
hyperparam  1 / 1 
Best valid mse: 56.34832  
optimal ntree: 448  
Training time: 7.196346  sec. 
Best for scenario:  sim81  
hyperparam  1 / 1 
Best valid mse: 52.71023  
optimal ntree: 397  
Training time: 6.145195  sec. 
Best for scenario:  sim82  
hyperparam  1 / 1 
Best valid mse: 60.70158  
optimal ntree: 356  
Training time: 6.116415  sec. 
Best for scenario:  sim83  
hyperparam  1 / 1 
Best valid mse: 52.28889  
optimal ntree: 281  
Training time: 4.616353  sec. 
Best for scenario:  sim84  
hyperparam  1 / 1 
Best valid mse: 49.49985  
optimal ntree: 330  
Training time: 4.89299  sec. 
Best for scenario:  sim85  
hyperparam  1 / 1 
Best valid mse: 56.55887  
optimal ntree: 280  
Training time: 4.383255  sec. 
Best for scenario:  sim86  
hyperparam  1 / 1 
Best valid mse: 56.75379  
optimal ntree: 1053  
Training time: 8.421956  sec. 
Best for scenario:  sim87  
hyperparam  1 / 1 
Best valid mse: 56.33146  
optimal ntree: 406  
Training time: 5.084105  sec. 
Best for scenario:  sim88  
hyperparam  1 / 1 
Best valid mse: 53.97764  
optimal ntree: 571  
Training time: 6.098414  sec. 
Best for scenario:  sim89  
hyperparam  1 / 1 
Best valid mse: 62.56728  
optimal ntree: 356  
Training time: 5.415524  sec. 
Best for scenario:  sim90  
hyperparam  1 / 1 
Best valid mse: 55.01317  
optimal ntree: 317  
Training time: 7.696967  sec. 
Best for scenario:  sim91  
hyperparam  1 / 1 
Best valid mse: 56.06621  
optimal ntree: 351  
Training time: 5.708203  sec. 
Best for scenario:  sim92  
hyperparam  1 / 1 
Best valid mse: 54.37888  
optimal ntree: 441  
Training time: 5.556663  sec. 
Best for scenario:  sim93  
hyperparam  1 / 1 
Best valid mse: 59.14325  
optimal ntree: 908  
Training time: 10.33407  sec. 
Best for scenario:  sim94  
hyperparam  1 / 1 
Best valid mse: 55.5352  
optimal ntree: 277  
Training time: 5.831233  sec. 
Best for scenario:  sim95  
hyperparam  1 / 1 
Best valid mse: 51.61785  
optimal ntree: 429  
Training time: 6.02435  sec. 
Best for scenario:  sim96  
hyperparam  1 / 1 
Best valid mse: 57.50831  
optimal ntree: 490  
Training time: 5.285491  sec. 
Best for scenario:  sim97  
hyperparam  1 / 1 
Best valid mse: 49.39495  
optimal ntree: 424  
Training time: 4.934937  sec. 
Best for scenario:  sim98  
hyperparam  1 / 1 
Best valid mse: 58.76595  
optimal ntree: 319  
Training time: 4.34882  sec. 
Best for scenario:  sim99  
hyperparam  1 / 1 
Best valid mse: 56.64473  
optimal ntree: 431  
Training time: 6.017951  sec. 
Best for scenario:  sim100  
hyperparam  1 / 1 
Best valid mse: 50.86109  
optimal ntree: 344  
Training time: 5.003828  sec. 

2.2 Unaware premium

The unaware price, \(\mu^U(\mathbf{x})\), excludes direct use of \(D\). It is defined as the best predictor of \(\widehat{\mu}^B(\mathbf{X}, D)\) given \(\mathbf{x}\):
\[\begin{equation*} \widehat{\mu}^U(\mathbf{x}) = E\{\widehat{\mu}^B(\mathbf{x}, D) | X = \mathbf{x}\}. \end{equation*}\]
This maintains consistency with the best estimate premium, whether \(\widehat{\mu}^B\) is data-driven or based on indicated rates (Complement 5 of the main paper). One can use a to predict \(\widehat{\mu}^B(\mathbf{X}, D)\) based on \(\mathbf{X}\). The loss function defaults to mean squared error as it is a reasonable distributional assumption for this response variable.

Alternatively (what we do), one can estimate the propensity and explicitly weight the best-estimate premiums based on predicted propensity. This also keeps consistency with the best-estimate.

Estimating the propensity function
source('___lgb_pdx_estimate.R') 

## clean the preds repo
unlink(file.path('preds', "*_pdx.json"))

  hyperparameter_grid <- expand.grid(
    learning_rate = c(0.01),
    feature_fraction = c(0.75),
    bagging_fraction = c(0.75),
    max_depth = c(5),
    lambda_l1 = c(0.5),
    lambda_l2 = c(0.5)
  )

pdx_lgb <- setNames(nm = names(sims)) %>% lapply(function(name){
  list_df <- list('train' = sims[[name]],
                  'valid' = valid[[name]],
                  'test' = test[[name]])
  the_pdx_lightgbm_fun(list_data = list_df, name = name,
                       hyperparameter_grid = hyperparameter_grid)
})
Pdx for scenario:  Scenario1  
hyperparam  1 / 1 
Best valid binary logloss: 0.5817981  
optimal ntree: 360  
Training time: 7.743604  sec. 
Pdx for scenario:  Scenario2  
hyperparam  1 / 1 
Best valid binary logloss: 0.5789044  
optimal ntree: 405  
Training time: 9.595778  sec. 
Pdx for scenario:  Scenario3  
hyperparam  1 / 1 
Best valid binary logloss: 0.6345684  
optimal ntree: 492  
Training time: 12.89847  sec. 
Training the best-estimate price on experimental data (for later use in section 5)
# Define grid for hyperparameter optimization
hyperparameter_grid_sims <- expand.grid(
    learning_rate = c(0.01),
    bagging_fraction = c(0.75),
    max_depth = c(6),
    lambda_l1 = c(0.5),
    lambda_l2 = c(0.5)
  )

pdx_sims <- lapply(seq_along(sim_samples), function(idx){
    list_df <- list('train' = sim_samples[[idx]]$train,
                  'valid' = sim_samples[[idx]]$valid,
                  'test' = sim_samples[[idx]]$test)
  the_pdx_lightgbm_fun(list_data = list_df, name = paste0('sim', idx),
                       hyperparameter_grid = hyperparameter_grid_sims)
})
Pdx for scenario:  sim1  
hyperparam  1 / 1 
Best valid binary logloss: 0.6379993  
optimal ntree: 116  
Training time: 1.948016  sec. 
Pdx for scenario:  sim2  
hyperparam  1 / 1 
Best valid binary logloss: 0.604375  
optimal ntree: 160  
Training time: 2.083842  sec. 
Pdx for scenario:  sim3  
hyperparam  1 / 1 
Best valid binary logloss: 0.5575499  
optimal ntree: 414  
Training time: 3.736413  sec. 
Pdx for scenario:  sim4  
hyperparam  1 / 1 
Best valid binary logloss: 0.5971327  
optimal ntree: 159  
Training time: 2.640875  sec. 
Pdx for scenario:  sim5  
hyperparam  1 / 1 
Best valid binary logloss: 0.5904888  
optimal ntree: 341  
Training time: 4.062044  sec. 
Pdx for scenario:  sim6  
hyperparam  1 / 1 
Best valid binary logloss: 0.6250062  
optimal ntree: 147  
Training time: 2.587351  sec. 
Pdx for scenario:  sim7  
hyperparam  1 / 1 
Best valid binary logloss: 0.6080139  
optimal ntree: 219  
Training time: 2.808161  sec. 
Pdx for scenario:  sim8  
hyperparam  1 / 1 
Best valid binary logloss: 0.5644839  
optimal ntree: 263  
Training time: 3.051121  sec. 
Pdx for scenario:  sim9  
hyperparam  1 / 1 
Best valid binary logloss: 0.6110802  
optimal ntree: 187  
Training time: 2.883613  sec. 
Pdx for scenario:  sim10  
hyperparam  1 / 1 
Best valid binary logloss: 0.6345554  
optimal ntree: 151  
Training time: 2.635355  sec. 
Pdx for scenario:  sim11  
hyperparam  1 / 1 
Best valid binary logloss: 0.5686411  
optimal ntree: 331  
Training time: 3.734863  sec. 
Pdx for scenario:  sim12  
hyperparam  1 / 1 
Best valid binary logloss: 0.6159331  
optimal ntree: 165  
Training time: 2.735916  sec. 
Pdx for scenario:  sim13  
hyperparam  1 / 1 
Best valid binary logloss: 0.5849873  
optimal ntree: 204  
Training time: 3.091587  sec. 
Pdx for scenario:  sim14  
hyperparam  1 / 1 
Best valid binary logloss: 0.5851934  
optimal ntree: 298  
Training time: 3.419677  sec. 
Pdx for scenario:  sim15  
hyperparam  1 / 1 
Best valid binary logloss: 0.5825115  
optimal ntree: 224  
Training time: 3.208649  sec. 
Pdx for scenario:  sim16  
hyperparam  1 / 1 
Best valid binary logloss: 0.6126702  
optimal ntree: 183  
Training time: 3.123254  sec. 
Pdx for scenario:  sim17  
hyperparam  1 / 1 
Best valid binary logloss: 0.5983043  
optimal ntree: 193  
Training time: 5.23823  sec. 
Pdx for scenario:  sim18  
hyperparam  1 / 1 
Best valid binary logloss: 0.5970047  
optimal ntree: 225  
Training time: 3.991795  sec. 
Pdx for scenario:  sim19  
hyperparam  1 / 1 
Best valid binary logloss: 0.6382285  
optimal ntree: 121  
Training time: 2.463771  sec. 
Pdx for scenario:  sim20  
hyperparam  1 / 1 
Best valid binary logloss: 0.6025327  
optimal ntree: 214  
Training time: 3.472324  sec. 
Pdx for scenario:  sim21  
hyperparam  1 / 1 
Best valid binary logloss: 0.5378712  
optimal ntree: 323  
Training time: 3.619435  sec. 
Pdx for scenario:  sim22  
hyperparam  1 / 1 
Best valid binary logloss: 0.5684613  
optimal ntree: 277  
Training time: 3.598076  sec. 
Pdx for scenario:  sim23  
hyperparam  1 / 1 
Best valid binary logloss: 0.6079414  
optimal ntree: 142  
Training time: 2.256139  sec. 
Pdx for scenario:  sim24  
hyperparam  1 / 1 
Best valid binary logloss: 0.614944  
optimal ntree: 157  
Training time: 3.527215  sec. 
Pdx for scenario:  sim25  
hyperparam  1 / 1 
Best valid binary logloss: 0.6057661  
optimal ntree: 229  
Training time: 4.347793  sec. 
Pdx for scenario:  sim26  
hyperparam  1 / 1 
Best valid binary logloss: 0.5835863  
optimal ntree: 217  
Training time: 4.235283  sec. 
Pdx for scenario:  sim27  
hyperparam  1 / 1 
Best valid binary logloss: 0.5684134  
optimal ntree: 326  
Training time: 4.756191  sec. 
Pdx for scenario:  sim28  
hyperparam  1 / 1 
Best valid binary logloss: 0.6112137  
optimal ntree: 235  
Training time: 4.134582  sec. 
Pdx for scenario:  sim29  
hyperparam  1 / 1 
Best valid binary logloss: 0.5780127  
optimal ntree: 317  
Training time: 4.11689  sec. 
Pdx for scenario:  sim30  
hyperparam  1 / 1 
Best valid binary logloss: 0.6275452  
optimal ntree: 131  
Training time: 2.157132  sec. 
Pdx for scenario:  sim31  
hyperparam  1 / 1 
Best valid binary logloss: 0.5804082  
optimal ntree: 181  
Training time: 2.921597  sec. 
Pdx for scenario:  sim32  
hyperparam  1 / 1 
Best valid binary logloss: 0.5763533  
optimal ntree: 351  
Training time: 3.937903  sec. 
Pdx for scenario:  sim33  
hyperparam  1 / 1 
Best valid binary logloss: 0.5950109  
optimal ntree: 204  
Training time: 3.363517  sec. 
Pdx for scenario:  sim34  
hyperparam  1 / 1 
Best valid binary logloss: 0.5986753  
optimal ntree: 192  
Training time: 3.189388  sec. 
Pdx for scenario:  sim35  
hyperparam  1 / 1 
Best valid binary logloss: 0.5919205  
optimal ntree: 630  
Training time: 5.726014  sec. 
Pdx for scenario:  sim36  
hyperparam  1 / 1 
Best valid binary logloss: 0.6075012  
optimal ntree: 171  
Training time: 3.031379  sec. 
Pdx for scenario:  sim37  
hyperparam  1 / 1 
Best valid binary logloss: 0.6198464  
optimal ntree: 154  
Training time: 2.906674  sec. 
Pdx for scenario:  sim38  
hyperparam  1 / 1 
Best valid binary logloss: 0.6218275  
optimal ntree: 145  
Training time: 2.613921  sec. 
Pdx for scenario:  sim39  
hyperparam  1 / 1 
Best valid binary logloss: 0.5972373  
optimal ntree: 217  
Training time: 3.040666  sec. 
Pdx for scenario:  sim40  
hyperparam  1 / 1 
Best valid binary logloss: 0.6137383  
optimal ntree: 128  
Training time: 2.440228  sec. 
Pdx for scenario:  sim41  
hyperparam  1 / 1 
Best valid binary logloss: 0.5639046  
optimal ntree: 388  
Training time: 6.495343  sec. 
Pdx for scenario:  sim42  
hyperparam  1 / 1 
Best valid binary logloss: 0.6043419  
optimal ntree: 188  
Training time: 4.328391  sec. 
Pdx for scenario:  sim43  
hyperparam  1 / 1 
Best valid binary logloss: 0.5705812  
optimal ntree: 237  
Training time: 7.522187  sec. 
Pdx for scenario:  sim44  
hyperparam  1 / 1 
Best valid binary logloss: 0.6216577  
optimal ntree: 146  
Training time: 4.39523  sec. 
Pdx for scenario:  sim45  
hyperparam  1 / 1 
Best valid binary logloss: 0.6077637  
optimal ntree: 316  
Training time: 6.774113  sec. 
Pdx for scenario:  sim46  
hyperparam  1 / 1 
Best valid binary logloss: 0.5901497  
optimal ntree: 247  
Training time: 5.998063  sec. 
Pdx for scenario:  sim47  
hyperparam  1 / 1 
Best valid binary logloss: 0.5579147  
optimal ntree: 307  
Training time: 5.896523  sec. 
Pdx for scenario:  sim48  
hyperparam  1 / 1 
Best valid binary logloss: 0.5843539  
optimal ntree: 343  
Training time: 6.286552  sec. 
Pdx for scenario:  sim49  
hyperparam  1 / 1 
Best valid binary logloss: 0.614583  
optimal ntree: 160  
Training time: 3.865074  sec. 
Pdx for scenario:  sim50  
hyperparam  1 / 1 
Best valid binary logloss: 0.5732406  
optimal ntree: 459  
Training time: 8.22917  sec. 
Pdx for scenario:  sim51  
hyperparam  1 / 1 
Best valid binary logloss: 0.5708289  
optimal ntree: 235  
Training time: 5.010983  sec. 
Pdx for scenario:  sim52  
hyperparam  1 / 1 
Best valid binary logloss: 0.5913323  
optimal ntree: 230  
Training time: 3.850774  sec. 
Pdx for scenario:  sim53  
hyperparam  1 / 1 
Best valid binary logloss: 0.6106269  
optimal ntree: 191  
Training time: 3.069129  sec. 
Pdx for scenario:  sim54  
hyperparam  1 / 1 
Best valid binary logloss: 0.595427  
optimal ntree: 208  
Training time: 3.47573  sec. 
Pdx for scenario:  sim55  
hyperparam  1 / 1 
Best valid binary logloss: 0.588144  
optimal ntree: 306  
Training time: 5.16458  sec. 
Pdx for scenario:  sim56  
hyperparam  1 / 1 
Best valid binary logloss: 0.6320633  
optimal ntree: 134  
Training time: 2.535409  sec. 
Pdx for scenario:  sim57  
hyperparam  1 / 1 
Best valid binary logloss: 0.5685531  
optimal ntree: 275  
Training time: 3.459567  sec. 
Pdx for scenario:  sim58  
hyperparam  1 / 1 
Best valid binary logloss: 0.5728599  
optimal ntree: 300  
Training time: 4.048972  sec. 
Pdx for scenario:  sim59  
hyperparam  1 / 1 
Best valid binary logloss: 0.5752295  
optimal ntree: 257  
Training time: 4.667999  sec. 
Pdx for scenario:  sim60  
hyperparam  1 / 1 
Best valid binary logloss: 0.561215  
optimal ntree: 371  
Training time: 4.729726  sec. 
Pdx for scenario:  sim61  
hyperparam  1 / 1 
Best valid binary logloss: 0.5745739  
optimal ntree: 212  
Training time: 3.996554  sec. 
Pdx for scenario:  sim62  
hyperparam  1 / 1 
Best valid binary logloss: 0.5677081  
optimal ntree: 382  
Training time: 5.065061  sec. 
Pdx for scenario:  sim63  
hyperparam  1 / 1 
Best valid binary logloss: 0.5894894  
optimal ntree: 211  
Training time: 3.250722  sec. 
Pdx for scenario:  sim64  
hyperparam  1 / 1 
Best valid binary logloss: 0.5963445  
optimal ntree: 349  
Training time: 4.712366  sec. 
Pdx for scenario:  sim65  
hyperparam  1 / 1 
Best valid binary logloss: 0.646684  
optimal ntree: 116  
Training time: 2.191305  sec. 
Pdx for scenario:  sim66  
hyperparam  1 / 1 
Best valid binary logloss: 0.6212037  
optimal ntree: 126  
Training time: 2.160214  sec. 
Pdx for scenario:  sim67  
hyperparam  1 / 1 
Best valid binary logloss: 0.6023696  
optimal ntree: 202  
Training time: 3.273964  sec. 
Pdx for scenario:  sim68  
hyperparam  1 / 1 
Best valid binary logloss: 0.5888413  
optimal ntree: 310  
Training time: 3.801854  sec. 
Pdx for scenario:  sim69  
hyperparam  1 / 1 
Best valid binary logloss: 0.6065003  
optimal ntree: 155  
Training time: 2.364828  sec. 
Pdx for scenario:  sim70  
hyperparam  1 / 1 
Best valid binary logloss: 0.6173803  
optimal ntree: 149  
Training time: 2.553649  sec. 
Pdx for scenario:  sim71  
hyperparam  1 / 1 
Best valid binary logloss: 0.5845285  
optimal ntree: 327  
Training time: 3.706515  sec. 
Pdx for scenario:  sim72  
hyperparam  1 / 1 
Best valid binary logloss: 0.5860112  
optimal ntree: 217  
Training time: 2.915491  sec. 
Pdx for scenario:  sim73  
hyperparam  1 / 1 
Best valid binary logloss: 0.5910884  
optimal ntree: 259  
Training time: 3.716627  sec. 
Pdx for scenario:  sim74  
hyperparam  1 / 1 
Best valid binary logloss: 0.5811708  
optimal ntree: 331  
Training time: 4.306144  sec. 
Pdx for scenario:  sim75  
hyperparam  1 / 1 
Best valid binary logloss: 0.5682469  
optimal ntree: 407  
Training time: 5.752871  sec. 
Pdx for scenario:  sim76  
hyperparam  1 / 1 
Best valid binary logloss: 0.6063615  
optimal ntree: 200  
Training time: 6.784282  sec. 
Pdx for scenario:  sim77  
hyperparam  1 / 1 
Best valid binary logloss: 0.5836817  
optimal ntree: 340  
Training time: 9.837525  sec. 
Pdx for scenario:  sim78  
hyperparam  1 / 1 
Best valid binary logloss: 0.6177518  
optimal ntree: 156  
Training time: 6.528549  sec. 
Pdx for scenario:  sim79  
hyperparam  1 / 1 
Best valid binary logloss: 0.6146692  
optimal ntree: 134  
Training time: 6.647015  sec. 
Pdx for scenario:  sim80  
hyperparam  1 / 1 
Best valid binary logloss: 0.6070313  
optimal ntree: 228  
Training time: 8.042032  sec. 
Pdx for scenario:  sim81  
hyperparam  1 / 1 
Best valid binary logloss: 0.5967104  
optimal ntree: 216  
Training time: 7.600263  sec. 
Pdx for scenario:  sim82  
hyperparam  1 / 1 
Best valid binary logloss: 0.602811  
optimal ntree: 212  
Training time: 6.915087  sec. 
Pdx for scenario:  sim83  
hyperparam  1 / 1 
Best valid binary logloss: 0.5677336  
optimal ntree: 214  
Training time: 7.670902  sec. 
Pdx for scenario:  sim84  
hyperparam  1 / 1 
Best valid binary logloss: 0.6145164  
optimal ntree: 187  
Training time: 6.004043  sec. 
Pdx for scenario:  sim85  
hyperparam  1 / 1 
Best valid binary logloss: 0.5937969  
optimal ntree: 230  
Training time: 6.345938  sec. 
Pdx for scenario:  sim86  
hyperparam  1 / 1 
Best valid binary logloss: 0.6027287  
optimal ntree: 276  
Training time: 7.571508  sec. 
Pdx for scenario:  sim87  
hyperparam  1 / 1 
Best valid binary logloss: 0.5771001  
optimal ntree: 200  
Training time: 7.213918  sec. 
Pdx for scenario:  sim88  
hyperparam  1 / 1 
Best valid binary logloss: 0.6156868  
optimal ntree: 146  
Training time: 6.566042  sec. 
Pdx for scenario:  sim89  
hyperparam  1 / 1 
Best valid binary logloss: 0.6262274  
optimal ntree: 162  
Training time: 5.564465  sec. 
Pdx for scenario:  sim90  
hyperparam  1 / 1 
Best valid binary logloss: 0.5610517  
optimal ntree: 251  
Training time: 7.157968  sec. 
Pdx for scenario:  sim91  
hyperparam  1 / 1 
Best valid binary logloss: 0.5692756  
optimal ntree: 362  
Training time: 7.883253  sec. 
Pdx for scenario:  sim92  
hyperparam  1 / 1 
Best valid binary logloss: 0.5620006  
optimal ntree: 492  
Training time: 10.35524  sec. 
Pdx for scenario:  sim93  
hyperparam  1 / 1 
Best valid binary logloss: 0.5907637  
optimal ntree: 178  
Training time: 6.044312  sec. 
Pdx for scenario:  sim94  
hyperparam  1 / 1 
Best valid binary logloss: 0.5977362  
optimal ntree: 209  
Training time: 7.359194  sec. 
Pdx for scenario:  sim95  
hyperparam  1 / 1 
Best valid binary logloss: 0.6226825  
optimal ntree: 224  
Training time: 7.610741  sec. 
Pdx for scenario:  sim96  
hyperparam  1 / 1 
Best valid binary logloss: 0.59383  
optimal ntree: 202  
Training time: 9.454721  sec. 
Pdx for scenario:  sim97  
hyperparam  1 / 1 
Best valid binary logloss: 0.5356933  
optimal ntree: 420  
Training time: 10.42067  sec. 
Pdx for scenario:  sim98  
hyperparam  1 / 1 
Best valid binary logloss: 0.5934672  
optimal ntree: 250  
Training time: 10.3707  sec. 
Pdx for scenario:  sim99  
hyperparam  1 / 1 
Best valid binary logloss: 0.5891839  
optimal ntree: 163  
Training time: 7.269016  sec. 
Pdx for scenario:  sim100  
hyperparam  1 / 1 
Best valid binary logloss: 0.6119387  
optimal ntree: 180  
Training time: 3.10501  sec. 

2.3 Aware premium

The aware premium, \(\widehat{\mu}^A(\mathbf{x})\), requires knowledge of the marginal distribution of \(D\). The aware price, a particular case of the ``discrimination-free’’ price of Lindholm et al. (2022), is computed as:
\[\begin{equation*} \widehat{\mu}^A(\mathbf{x}) = \sum_{d\in \mathcal{D}} \widehat{\mu}^B(\mathbf{x}, d) \widehat{\Pr}(D = d). \end{equation*}\] As discussed earlier, we assume \(D\) is discrete or discretized. If the training sample is representative of the target population (portfolio, market, or region?), empirical proportions are consistent estimators of \(\widehat{\Pr}(D = d)\). This is also an estimator suggested in Section 5 of Lindholm et al. (2022).

Computing the empirical proportions for protected supopulations
marg_dist <- sims %>% lapply(function(data){
  data$D %>% table %>%  prop.table
})

## for later use
saveRDS(marg_dist, 'preds/the_empirical_proportions.rds')

## Computing the empirical proportions for protected supopulations on experimental data (for later use in section 5)
marg_dist_sims <- seq_along(sim_samples) %>% lapply(function(idx){
  sim_samples[[idx]]$train$D %>% table %>%  prop.table
})
Note

This method generalizes the use of \(D\) as a control variable to prevent distortions in estimated effect of \(\mathbf{X}\) due to omitted variable bias. It applies to all model types and has a causal interpretation under the assumptions that: (i) possible values of \(\mathbf{X}\) are observed across all groups of \(D\) (no extrapolation), (ii) \(D\) causes both \(\mathbf{X}\) and \(Y\), and (iii) there are no relevant unobserved variables. Under the same assumptions, other methods are estimators of the aware premium. One example is inverse probability weighting, which re-weights observations to (artificially) remove the association between \(\mathbf{X}\) and \(D\), preventing proxy effects.

2.4 Corrective premium

We estimate the corrective premium, \(\widehat{\mu}^C(\mathbf{x}, d)\), designed to enforce strong demographic parity by aligning premium distributions across protected groups. Various methods can achieve this, but we recommend the optimal transport approach of Hu, Ratz, and Charpentier (2024). Its advantage lies in modifying \(\widehat{\mu^B}(\mathbf{x}, d)\) while keeping the overall spectrum estimation anchored to the initial best-estimate premium. As an incremental adjustment, it minimizes modeling complexity while enforcing demographic parity, yielding a simple estimator for the corrective premium.

The corrective premium is computed by training a transport function that shifts best-estimate premiums for each protected group toward a common barycenter, ensuring demographic parity through corrective direct discrimination. See Hu, Ratz, and Charpentier (2024) for details. This optimal transport method is implemented in Python via the Equipy package of Fernandes Machado et al. (2025). Listing lst-fair_wasserstein illustrates how this method, despite its complexity, translates into concise Python code.

Listing 2.1: Illustrative Python code for wasserstein transport toward corrective premiums.
import numpy as np
import pandas as pd
from equipy.fairness import FairWasserstein

# Load best-estimate premiums and associated sensitive attributes
best_est_train = pd.read_csv('best_est_prem_train.csv')  # Training premiums
sens_train = pd.read_csv('sensitive_train.csv')  # discrete sensitive data

# Train Fair Wasserstein transport to adjust premiums
barycenter = FairWasserstein()
barycenter.fit(best_est.values, sens.values)

# Load best-estimate premiums and associated sensitive attributes
best_est_test = pd.read_csv('best_est_prem_test.csv')  # Training premiums
sens_test = pd.read_csv('sensitive_test.csv')  # discrete sensitive data

# Apply transformation to obtain fairness-adjusted premiums
corrective_premiums_test = barycenter.transform(best_est_test.values, sens_test.values)

# Save results
pd.DataFrame(corrective_premiums).to_csv('corr_prem_test.csv', index=False)
Optimal transport and wasserstein distance : technical details

ICI ARTHUR?

Computing the optimal transport mapping for the scenarios
source_python("___opt_transp.py")

## now, the folder 'transported' exist, with one epsilon_y file per scenario

To predict corrective premium on other datasets, one can alternatively train a lightgbm model of mapping from \((X, D)\) to the resulting corrective premiums from Equipy on the training sample.

Computing the optimal transport mapping for the scenarios
source('___lgb_mapping_to_corr.R') 

corr_mapping_lgb <- setNames(nm = names(sims)) %>%
  lapply(the_mapping_to_corr_lightgbm_fun)
Mapping to corr. for scenario:  Scenario1  
Data import: 0.3252981  sec. 
Data conversion: 0.01066899  sec. 
Lgb training : 17.72908  sec. 
Mapping to corr. for scenario:  Scenario2  
Data import: 0.296618  sec. 
Data conversion: 0.01664019  sec. 
Lgb training : 8.731765  sec. 
Mapping to corr. for scenario:  Scenario3  
Data import: 0.2445309  sec. 
Data conversion: 0.02969599  sec. 
Lgb training : 21.20496  sec. 

2.5 Hyperaware premium

The hyperaware premium, \(\widehat{\mu}^H(\mathbf{x})\), is the best non-directly discriminatory approximation of the corrective premium. We construct a lightgbm using only \(\mathbf{X}\) to predict the corrective premium, aiming at demographic parity without direct reliance on \(D\):
\[\begin{equation*} \widehat{\mu}^H(\mathbf{x}) = E\{\widehat{\mu}^C(\mathbf{x}, D) | X = \mathbf{x}\}. \end{equation*}\]
Since the response variable is a premium, MSE is a correct choice of loss function in most cases. Just as for the unaware premium, one can explicitly aggregate corrective premiums across protected groups using estimated propensities as weights. This is what we do here.

2.6 Visualization the estimated spectrum

By combining all the trained models as component into a trained spectrum of fairness, we can define the premium function that will serve to predict the premiums

Defining the estimated premium and metrics functions
premiums <- setNames(nm = names(sims)) %>% lapply(function(name){
  premium_generator(best = best_lgb[[name]]$pred_fun, 
                  pdx = pdx_lgb[[name]]$pred_fun, 
                  maps_to_corr = corr_mapping_lgb[[name]], 
                  marg = marg_dist[[name]])
})

quants <- setNames(nm = names(sims)) %>% lapply(function(name){
  quant_generator(premiums = premiums[[name]])
})


## For experimental data (future use in section 5)
premiums_sims <- setNames(obj = seq_along(sim_samples),
                        nm = names(sim_samples)) %>% lapply(function(idx){
  premium_generator(best = best_sims[[idx]]$pred_fun, 
                  pdx = pdx_sims[[idx]]$pred_fun, 
                  maps_to_corr = corr_mapping_lgb[['Scenario1']], ## We put anything, won't be used anyway as we focus on proxy vulnerabiltiy for section 5. 
                  marg = marg_dist_sims[[idx]])
})

quants_sims <- setNames(obj = seq_along(sim_samples),
                        nm = names(sim_samples)) %>% lapply(function(idx){
  quant_generator(premiums = premiums_sims[[idx]])
})
Computation of estimated premiums and local metrics across the grid
df_to_g_file <- "preds/df_to_g.json"

# Check if the JSON file exists
if (file.exists(df_to_g_file)) {
  message(sprintf("[%s] File exists. Reading df_to_g from %s", Sys.time(), df_to_g_file))
  df_to_g <- fromJSON(df_to_g_file)
} else {

## From the first section of the online supplement
preds_grid_stats_theo <- fromJSON('preds/preds_grid_stats_theo.json')
  
df_to_g <- setNames(nm = names(sims)) %>% lapply(function(name) {
  message(sprintf("[%s] Processing: %s", Sys.time(), name))
  
  local_scenario_df <- preds_grid_stats_theo[[name]]
  
  # Start time for this scenario
  start_time <- Sys.time()
  
  # Step 1: Compute premiums
  message(sprintf("[%s] Step 1: Computing premiums", Sys.time()))
  premium_results <- setNames(nm = levels_for_premiums) %>%
    sapply(function(s) {
      message(sprintf("[%s] Computing premium: %s", Sys.time(), s))
      premiums[[name]][[s]](
        x1 = local_scenario_df$x1,
        x2 = local_scenario_df$x2,
        d = local_scenario_df$d
      )
    })
  
  # Step 2: Compute PDX
  message(sprintf("[%s] Step 2: Computing PDX", Sys.time()))
  pdx_results <- pdx_lgb[[name]]$pred_fun(
    data.frame(
      X1 = local_scenario_df$x1,
      X2 = local_scenario_df$x2,
      D = local_scenario_df$d
    )
  )
  
  # Combine results
  message(sprintf("[%s] Step 5: Combining results", Sys.time()))
  result <- data.frame(
    local_scenario_df,
    premium_results,
    pdx = pdx_results
  )
  
  # Log completion time
  end_time <- Sys.time()
  message(sprintf("[%s] Finished processing: %s (Duration: %.2f seconds)", end_time, name, as.numeric(difftime(end_time, start_time, units = "secs"))))
  
  return(result)
})

# Save the entire df_to_g object to JSON
  message(sprintf("[%s] Saving df_to_g to %s", Sys.time(), df_to_g_file))
  toJSON(df_to_g, pretty = TRUE, auto_unbox = TRUE) %>% write(df_to_g_file)
  rm(df_to_g_theo)
}

grid_stats_path <- 'preds/preds_grid_stats.json'

# Check and load or compute preds_grid_stats
if (file.exists(grid_stats_path)) {
  preds_grid_stats <- fromJSON(grid_stats_path)
} else {
  preds_grid_stats <- setNames(nm = names(df_to_g)) %>% 
    lapply(function(name) {
      data.frame(df_to_g[[name]], 
                 setNames(nm = levels_for_quants) %>% 
                       sapply(function(s) {
                         quants[[name]][[s]](x1 = df_to_g[[name]]$x1,
                                             x2 = df_to_g[[name]]$x2,
                                             d = df_to_g[[name]]$d)
                       }))
    })
  toJSON(preds_grid_stats, pretty = TRUE, auto_unbox = TRUE) %>% 
    write(grid_stats_path)
}
Computing estimated premiums and local metrics for the simulations
pop_to_g_file <- "preds/pop_to_g.json"

# Check if the JSON file exists
if (file.exists(pop_to_g_file)) {
  message(sprintf("[%s] File exists. Reading pop_to_g from %s", Sys.time(), pop_to_g_file))
  pop_to_g <- fromJSON(pop_to_g_file)
} else {
## From the first section of the online supplement
preds_pop_stats_theo <- fromJSON('preds/preds_pop_stats_theo.json')
  
pop_to_g <- setNames(nm = names(sims)) %>% lapply(function(name) {
  message(sprintf("[%s] Processing: %s", Sys.time(), name))
  
  # Start time for this simulation
  start_time <- Sys.time()
  
  list_data <- list('train' = sims[[name]],
                    'valid' = valid[[name]],
                    'test' = test[[name]])
  
  result <- setNames(nm = names(list_data)) %>% lapply(function(nm){
    
    data <- list_data[[nm]]
    
    # Step 1: Compute premiums
  message(sprintf("[%s] Step 1: Computing premiums", Sys.time()))
  premium_results <- setNames(nm = levels_for_premiums) %>%
    sapply(function(s) {
      message(sprintf("[%s] Computing premium: %s", Sys.time(), s))
      premiums[[name]][[s]](
        x1 = data$X1,
        x2 = data$X2,
        d = data$D
      )
    })
  
  # Step 2: Compute PDX
  message(sprintf("[%s] Step 2: Computing PDX", Sys.time()))
  pdx_results <- pdx_lgb[[name]]$pred_fun(
    data.frame(
      X1 = data$X1,
      X2 = data$X2,
      D = data$D
    )
  )
  
  # Combine results
  message(sprintf("[%s] Step 5: Combining results", Sys.time()))
  data.frame(
    preds_pop_stats_theo[[name]][[nm]],
    premium_results,
    pdx = pdx_results
  )
  }) 
  
  
  # Log completion time
  end_time <- Sys.time()
  message(sprintf("[%s] Finished processing: %s (Duration: %.2f seconds)", end_time, name, as.numeric(difftime(end_time, start_time, units = "secs"))))
  
  return(result)
})

# Save the entire df_to_g object to JSON
  message(sprintf("[%s] Saving pop_to_g to %s", Sys.time(), pop_to_g_file))
  toJSON(pop_to_g, pretty = TRUE, auto_unbox = TRUE) %>% write(pop_to_g_file)
  
  rm(pop_to_g_theo)
}


pop_stats_path <- 'preds/preds_pop_stats.json'

# Check and load or compute preds_pop_stats
if (file.exists(pop_stats_path)) {
  preds_pop_stats <- fromJSON(pop_stats_path)
} else {
  preds_pop_stats <- setNames(nm = names(sims)) %>% 
    lapply(function(name) {
      setNames(nm = names(pop_to_g[[name]])) %>%  
        lapply(function(set) {
          local_df <- pop_to_g[[name]][[set]]
          
          data.frame(local_df, 
                 setNames(nm = levels_for_quants) %>% 
                       sapply(function(s) {
                         quants[[name]][[s]](x1 = local_df$X1,
                                             x2 = local_df$X2,
                                             d =  local_df$D)
                       }))
        })
    })
  toJSON(preds_pop_stats, pretty = TRUE, auto_unbox = TRUE) %>% 
    write(pop_stats_path)
}
Computing estimated premiums and local metrics for the experiment
sims_to_g_file <- "preds/sims_to_g.json"

# Check if the JSON file exists
if (file.exists(sims_to_g_file)) {
  message(sprintf("[%s] File exists. Reading sims_to_g from %s", Sys.time(), sims_to_g_file))
  sims_to_g <- fromJSON(sims_to_g_file)
} else {
## From the first section of the online supplement
preds_sims_stats_theo <- fromJSON('preds/preds_sims_stats_theo.json')
sims_to_g <- setNames(object = seq_along(sim_samples), 
                      nm = names(sim_samples)) %>% lapply(function(idx) {
  message(sprintf("[%s] Processing: %s", Sys.time(), paste0(idx, '/', length(sim_samples))))
  
  samples_to_ret <- setNames(nm = names(sim_samples[[idx]])) %>% lapply(function(nm){
  data <- preds_sims_stats_theo[[idx]][[nm]]
    
    # Step 1: Compute premiums
  message(sprintf("[%s] Step 1: Computing premiums", Sys.time()))
  premium_results <- setNames(nm = levels_for_premiums) %>%
    sapply(function(s) {
      message(sprintf("[%s] Computing premium: %s", Sys.time(), s))
      premiums_sims[[idx]][[s]](
        x1 = data$X1,
        x2 = data$X2,
        d = data$D
      )
    })
  
  # Step 2: Compute PDX
  message(sprintf("[%s] Step 2: Computing PDX", Sys.time()))
  pdx_results <- pdx_sims[[idx]]$pred_fun(
    data.frame(
      X1 = data$X1,
      X2 = data$X2,
      D = data$D
    )
  )
  
  
  # Combine results
  message(sprintf("[%s] Step 5: Combining results", Sys.time()))
  result <- data.frame(
    data,
    premium_results,
    pdx = pdx_results
  )
    
  })
  
  return(samples_to_ret)
})

# Save the entire df_to_g object to JSON
  message(sprintf("[%s] Saving sims_to_g to %s", Sys.time(), sims_to_g_file))
  toJSON(sims_to_g, pretty = TRUE, auto_unbox = TRUE) %>% write(sims_to_g_file)
}

sims_stats_path <- 'preds/preds_sims_stats.json'

# Check and load or compute preds_pop_stats
if (file.exists(sims_stats_path)) {
  preds_sims_stats <- fromJSON(sims_stats_path)
} else {
  preds_sims_stats <- setNames(nm = names(sims_to_g)) %>% 
    lapply(function(name) {
      setNames(nm = names(sims_to_g[[name]])) %>%  
        lapply(function(set) {
          local_df <- sims_to_g[[name]][[set]]
          
          data.frame(local_df, 
                 setNames(nm = levels_for_quants) %>% 
                       sapply(function(s) {
                         quants_sims[[name]][[s]](x1 = local_df$X1,
                                             x2 = local_df$X2,
                                             d =  local_df$D)
                       }))
        })
    })
  toJSON(preds_sims_stats, pretty = TRUE, auto_unbox = TRUE) %>% 
    write(sims_stats_path)
}
R code producing the estimated propensity illustration across scenarios
to_save_pdx_perpop <- names(df_to_g) %>% 
  lapply(function(name){
    cols <- the_CAS_colors
    pop_id <- which(names(df_to_g) == name)
    
    ## keep only axis for last plot
    if(pop_id == 2){ # If it's the last, apply correct xlabels
      the_y_scale <- ylim(0,1)
      the_y_label <- latex2exp::TeX("$\\widehat{P}(D = 1|x_1, x_2)$")
    } else { # otherwise, remove everything
      the_y_scale <- scale_y_continuous(labels = NULL, limits = c(0,1))
      the_y_label <- NULL
    }
    
    the_pops <- c('Scenario 1', 'Scenario 2', 'Scenario 3')
    
    ## lets graph
    df_to_g[[name]] %>% 
      mutate(the_population = name) %>% 
  filter(x1 >= -9, x1 <= 11,
         d == 1) %>% 
  group_by(x1, x2) %>% summarise(pdx = mean(pdx)) %>%  ungroup %>% 
  ggplot(aes(x = x1, y = pdx,
             lty = factor(x2),
             linewidth = factor(x2),
             shape = factor(x2),
             alpha = factor(x2),
             color = factor(x2))) +
  geom_line() +
  scale_linetype_manual(values = c('solid', '31', '21', '11'), name = latex2exp::TeX('$x_2$')) +
  scale_color_manual(values = cols, name = latex2exp::TeX('$x_2$')) +
  scale_linewidth_manual(values = c(2, 1, 0.85, 0.55), name = latex2exp::TeX('$x_2$')) +  
  scale_alpha_manual(values = c(0.65, 0.75, 0.85, 0.9), name = latex2exp::TeX('$x_2$')) + 
  labs(x = latex2exp::TeX("$x_1$"),
       y = the_y_label,
       title = latex2exp::TeX(the_pops[pop_id])) + 
  scale_x_continuous(breaks = c(-3:3)*3 + 1)  + # see above
  theme_minimal() + 
  the_y_scale +
  theme(plot.title = element_text(hjust=0.5))
  }) %>% ggpubr::ggarrange(plotlist = .,
                           nrow = 1,
                           widths = 15, heights = 1,
                           common.legend = T,
                           legend = 'right')

if (!dir.exists('figs')) dir.create('figs')
ggsave(filename = "figs/graph_pdx_perpop.png",
       plot = to_save_pdx_perpop,
       height = 3.25,
       width = 7.55,
       units = "in",
       device = "png", dpi = 500)
Figure 2.1: Estimated propensity in terms of \(x_1\) and \(x_2\) for simulations
R code producing the estimated best-estimate, unaware, and aware illustration.
to_save_premiumsdense_BUA_perpop <- names(df_to_g) %>% lapply(function(pop_name){
  
  ## the colors
    cols <- RColorBrewer::brewer.pal(5, 'Spectral')[1:3] %>%  colorspace::darken(0.25)
      # the_CAS_colors_full[c(6, 5, 2)]
  pop_id <- which(names(df_to_g) == pop_name)
  
  ## keep only axis for last plot
    if(pop_name == head(names(df_to_g), 1)){ # If it's the last, apply correct xlabels
      the_y_scale <- scale_y_continuous(breaks = c(90, 110, 130),
                     labels = scales::dollar,
                     limits = c(90, 140))
      the_y_label <- 
  latex2exp::TeX("$\\widehat{\\mu}(x_1, x_2, d)$")
    } else { # otherwise, remove everything
      the_y_scale <- scale_y_continuous(breaks = c(90, 110, 130),
                     labels = NULL,
                     limits = c(90, 140))
      the_y_label <- NULL
    }
  
  to_illustrate <- df_to_g[[pop_name]] %>%
  reshape2::melt(measure.vars = paste0(levels_for_premiums[1:3])) %>% 
    mutate(variable = factor(variable,
                             levels = paste0(levels_for_premiums[1:3]),
                             labels = labels_for_premiums[1:3])) %>% 
  filter(x1 <= 10, x1 >= -8) 
  
  to_illustrate$value[to_illustrate$pdx < 0.1] <- NA
  
  to_illustrate %>% 
  ggplot(aes(x = x1, y = value,
             lty = factor(d),
             linewidth = factor(x2),
             shape = factor(x2),
             alpha = factor(d),
             color = factor(variable))) +
  geom_line() +
  scale_linetype_manual(values = c('solid', '32'), name = latex2exp::TeX('$d$')) +
  scale_color_manual(values = cols, name = latex2exp::TeX('$\\mu$'), labels = latex2exp::TeX) +
  scale_linewidth_manual(values = c(1.5, 1, 0.85, 0.55), name = latex2exp::TeX('$x_2$')) +  
  scale_alpha_manual(values = c(0.35, 0.7), name = latex2exp::TeX('$d$')) + 
  labs(x = latex2exp::TeX("$x_1$"), y = the_y_label,
       title = paste0('Scenario ', pop_id)) +
  scale_x_continuous(breaks = c( -3, 0, 3, 6), limits = c(-4, 7)) +
  the_y_scale +
  theme_minimal()
}) %>% ggpubr::ggarrange(plotlist = .,
                           ncol = 3,
                           widths = c(18, 15, 15), heights = 1,
                           common.legend = T,
                           legend = 'right')

ggsave(filename = "figs/graph_premiumsdense_BUA_perpop.png",
       plot = to_save_premiumsdense_BUA_perpop,
       height = 4,
       width = 10.55,
       units = "in",
       device = "png", dpi = 500)
Figure 2.2: Estimated best-estimate \(\widehat{\mu}^B\), unaware \(\widehat{\mu}^U\), and aware \(\widehat{\mu}^A\) premiums for scenarios 1, 2, and 3 in the Example as a function of \(x_1\), \(x_2\), and \(d\).
R code producing the estimated aware, hyperaware, corrective illustration.
to_save_premiumsdense_AHC_perpop <- names(df_to_g) %>% lapply(function(pop_name){
  
  ## the colors
    cols <- RColorBrewer::brewer.pal(5, 'Spectral')[3:5] %>%  colorspace::darken(0.25)
  pop_id <- which(names(df_to_g) == pop_name)
  
  ## keep only axis for last plot
    if(pop_name == head(names(df_to_g), 1)){ # If it's the last, apply correct xlabels
      the_y_scale <- scale_y_continuous(breaks = c(90, 110, 130),
                     labels = scales::dollar,
                     limits = c(90, 130))
      the_y_label <- 
  latex2exp::TeX("$\\widehat{\\mu}(x_1, x_2, d)$")
    } else { # otherwise, remove everything
      the_y_scale <- scale_y_continuous(breaks = c(90, 110, 130),
                     labels = NULL,
                     limits = c(90, 130))
      the_y_label <- NULL
    }
  
  to_illustrate <- df_to_g[[pop_name]] %>%
  reshape2::melt(measure.vars = paste0(levels_for_premiums[3:5])) %>% 
    mutate(variable = factor(variable,
                             levels = paste0(levels_for_premiums[3:5]),
                             labels = labels_for_premiums[3:5])) %>% 
  filter(x1 <= 10, x1 >= -8) 
  # group_by(x1, d, variable) %>% summarise(value = mean(value)) %>%  ungroup %>% 
  
  ## mask part of the line 
  to_illustrate$value[to_illustrate$pdx < 0.1] <- NA
  
  to_illustrate %>% 
  ggplot(aes(x = x1, y = value,
             lty = factor(d),
             linewidth = factor(x2),
             shape = factor(x2),
             alpha = factor(d),
             color = factor(variable))) +
  geom_line() +
  scale_linetype_manual(values = c('solid', '32'), name = latex2exp::TeX('$d$')) +
  scale_color_manual(values = cols, name = latex2exp::TeX('$\\mu$'), labels = latex2exp::TeX) +
  scale_linewidth_manual(values = c(1.5, 1, 0.85, 0.55), name = latex2exp::TeX('$x_2$')) +  
  scale_alpha_manual(values = c(0.35, 0.7), name = latex2exp::TeX('$d$')) + 
  labs(x = latex2exp::TeX("$x_1$"), y = the_y_label,
       title = paste0('Scenario ', pop_id)) +
  scale_x_continuous(breaks = c( -3, 0, 3, 6), limits = c(-4, 7)) +
  the_y_scale +
  theme_minimal()
}) %>% ggpubr::ggarrange(plotlist = .,
                           ncol = 3,
                           widths = c(18, 15, 15),
                            heights = 1,
                           common.legend = T,
                           legend = 'right')

ggsave(filename = "figs/graph_premiumsdense_AHC_perpop.png",
       plot = to_save_premiumsdense_AHC_perpop,
       height = 4,
       width = 10.55,
       units = "in",
       device = "png", dpi = 500)
Figure 2.3: Estimated aware \(\widehat{\mu}^A\), hyperaware \(\widehat{\mu}^H\), and corrective \(\widehat{\mu}^C\) premiums for scenarios 1, 2, and 3 in the Example as a function of \(x_1\), \(x_2\), and \(d\).
Chevalier, Dominik, and Marie-Pier Côté. 2024. “From Point to Probabilistic Gradient Boosting for Claim Frequency and Severity Prediction.” arXiv Preprint arXiv:2412.14916. https://arxiv.org/abs/2412.14916.
Fernandes Machado, Agathe, Suzie Grondin, Philipp Ratz, Arthur Charpentier, and François Hu. 2025. “EquiPy: Sequential Fairness Using Optimal Transport in Python.” arXiv Preprint arXiv:2503.09866. https://arxiv.org/abs/2503.09866.
Hu, Francois, Philipp Ratz, and Arthur Charpentier. 2024. “A Sequentially Fair Mechanism for Multiple Sensitive Attributes.” Proceedings of the AAAI Conference on Artificial Intelligence 38 (11): 12502–10. https://doi.org/10.1609/aaai.v38i11.29143.
Ke, Guolin, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. “LightGBM: A Highly Efficient Gradient Boosting Decision Tree.” Advances in Neural Information Processing Systems 30: 3146–54.
Lindholm, Mathias, Ronald Richman, Andreas Tsanakas, and Mario V Wüthrich. 2022. “Discrimination-Free Insurance Pricing.” ASTIN Bulletin 52 (1): 55–89.